CW ADD: Adding a ‘TOT_PRCP’ row that sums up the total precipitation
between SNOW and PRCP. This row will be used in Question 3.
Local weather and global and local human environmental
footprint
Emily will re-do linear regression looking at measures of local and
global human activity as regressors rather than year. She might also
look into variable transformations (i.e., linear models fit to
polynomials of regressors) to see if the response is best fit as linear
or polynomial.
At the end of our exploratory data analysis, we developed a linear
model of maximum daily temperature over time, with year as a linear
regressor. This revealed to us that there is a statistically significant
increase in average maximum temperatures over time. However, we do not
suspect that time is the cause– rather, it is something else that has
changed over time that has caused the warming in New York. We wanted to
explore correlations with other, more direct proxies for human
activity.
Our original fit used year as a numerical regressor and month as a
categorical regressor. The resulting fit has an r-squared value of 0.775
and a slope of 0.025 degrees Fahrenheit per year, with all fit
parameters’ p-values well below 0.05. The different intercepts for the
each level of the categorical variable (the twelve months of the year)
indicated that January is the coldest and July the hottest month in
Central Park, with an average difference in maximum daily temperature of
approximately 46 degrees Fahrenheit in any given year over this
window.
maxTfit00_ym <- lm(formula = TMAX ~ year + month, data = NYweath_00a )
res00_ym <- residuals(maxTfit00_ym)
summary(maxTfit00_ym)
The two extremes and their linear models are plotted in the following
figure.
#plot of just July and January
ggplot(NYweath_00a, aes(x = year, y = TMAX, color = month)) +
geom_point() +
scale_color_manual(values = c("01" = "purple4",
"07" = "red"), na.value = NA) +
geom_abline(aes(intercept = -11.05508, slope = 0.02539), col = "black", size = 1) +
geom_abline(aes(intercept = 34.98295, slope = 0.02539), col = "black", size = 1) +
labs(
x = "Year",
y = "Maximum Daily Temperature",
title = "Maximum Daily Temperature in Central Park") +
xlab(label = "Year") +
ylab(label = "Maximum daily temperature") +
ggtitle(label = "Maximum Daily Temperature in Central Park")

Do other weather variables correlate to TMAX?
NYweath_cor <- subset(NYweath_00a, select = c(year, TMAX, PRCP, SNOW))
str(NYweath_cor)
weathcor <- cor(NYweath_cor, use = "pairwise.complete.obs")
corrplot::corrplot(weathcor)

cor
We have found a reasonable linear model for temperature over time,
but can we look instead at the connection to human activities, rather
than time? Can we use some aspect of human activity as a regressor and
generate a reasonable model?
We looked to the Census for U.S. population data, but that is only
reported decennially, so we looked for other sources. We found
historical data back to 1960 for New York state online https://www.macrotrends.net/cities/23083/new-york-city/population.
Because this source is not known to us, we validated it against
decennial census data.
A bunch of linear models…
#LM1
maxTfit00_m <- lm(formula = TMAX ~ month, data = NYweath_00a)
summary(maxTfit00_m)
#LM4
maxTfit00_all <- lm(formula = TMAX ~ year + month + PRCP, data = NYweath_00a)
summary(maxTfit00_all)
#maxTfit00_all_intrxn <- lm(formula = TMAX ~ year + month*day + PRCP + SNOW, data = NYweath_00a)
#summary(maxTfit00_all)
#anova(maxTfit00_m, maxTfit00_ym)
#anova(maxTfit00_all, maxTfit00_all_intrxn)
#LM2
maxTfit00_pop <- lm(formula = TMAX ~ Pop + month, data = NYweath_00a)
summary(maxTfit00_pop)
#maxTfit00_pop_all <- lm(formula = TMAX ~ Pop + month + PRCP, data = NYweath_00a)
#summary(maxTfit00_pop)
#plot of NYS Pop over time
ggplot(NYweath_00a, aes(x = year, y = Pop)) +
geom_point() +
# geom_abline(aes(intercept = -11.05508, slope = 0.02539), col = "black", size = 1) +
labs(
x = "Year",
y = "New York State Population",
title = "Annual Population of New York State") +
xlab(label = "Year") +
ylab(label = "New York State Population") +
ggtitle(label = "Annual Population in New York State")

Relationships between Local weather and air quality
The AQI is an index for reporting daily air quality. It tells how
clean or polluted the air is, and what associated health effects might
be a concern for the public. The higher the AQI value, the the greater
the level of air pollution and greater the health concern. Outdoor
concentrations of pollutants such as Ozone, Carbon Monoxide, Nitrogen
dioxide, Sulfur Dioxide, and PM2.5/PM10 concentrations are measured at
stations across New York City and reported to the EPA. The Daily Air
Quality Index (AQI) is calculated based on these concentration values
and stored within the EPA’s Air Quality System database.
As city life changes, so does its air quality. Sources of emissions
such as traffic and burning of fossil fuels for energy generation can
cause air quality to deteriorate. Emissions can also contribute to
global warming by releasing more greenhouse gasses into the atmosphere,
thus increasing average temperatures. As more people migrate to urban
areas, we will continue to see a deterioration in air quality unless
measures are taken. The goal for integrating this data is to study the
affects of weather patterns on air quality, and to statistically verify
changes in air quality over time in New York City.
The dataset contains about 7,000 observations collected from January,
2000 to October, 2022.
We start by looking at the distribution of our variables of interest:
AQI.
# distribution plot of pmi2.5 and daily AQI
mean_aqi <- mean(DailyAQ_00_22$AQI)
ggplot(DailyAQ_00_22) +
geom_histogram(aes(x=AQI), na.rm=TRUE, alpha=0.5, color="black", fill='#2DD164', bins=50, binwidth=5) +
geom_vline(xintercept=mean_aqi, color="black", size=1, linetype=5, show.legend=FALSE) +
annotate("text", x=mean_aqi + 9, y=625, label=paste(round(mean_aqi, 2)), angle=0, size=4, color="black") +
labs(title="Distribution of Daily AQI Level", x="", y="Count")
In the histogram depicting the distribution of AQI, we can gauge that
the distribution is fairly normal. Although it is slight right skewed,
the number of data points suggests we can treat it as normal for our
modeling. The right-skewness is caused by days with unusually high AQI
values.
ggplot(DailyAQ_00_22_Yearly_Growth, aes(group=1)) +
geom_line(aes(x = year, y = aqi_diffRate), na.rm = T, stat="identity", color="#043008", size=1) +
geom_point(aes(x = year, y = aqi_diffRate), na.rm = TRUE, fill="#E6E930", shape = 23) +
labs(title="AQI year-over-year rate in NYC", x="Year", y="AQI") +
theme(
axis.title.y = element_text(color = "#043008", size = 13),
axis.title.y.right = element_text(color = "#E6E930", size = 13)
)

The year-over-year growth rate was also calculated based on yearly
average AQI and is depicted in the 2nd line plot. We see an alternating
pattern between years that increases in variance as we move towards
2022.
In order to evaluate correlation between weather and air quality, we
combined our dataset with the NYC weather data based on the date value
in each. Dates without a matching air quality measurement are dropped.
Subsequent models will be built using this merged dataframe.
# merge data frame by date
DailyAQ_merged <- merge(DailyAQ_00_22, NYweath_00, by="DATE")
# select required columns
DailyAQ_merged <- DailyAQ_merged[ , c('DATE', 'year.x', 'month', 'AQI', 'TMAX', 'TMIN', 'PRCP', 'SNOW')]
colnames(DailyAQ_merged)[2] <- "year"
str(DailyAQ_merged)
xkablesummary(DailyAQ_merged)
The first step to building linear models is assessing correlation
between numerical variables in the data. Because the year variable in
our dataset begins at 2000, it will unnecessarily scale our coefficients
when used in linear modeling. We properly scaled the variable to start
at 0 (and continue to 22 to represent 2022).
The correlation is evaluated via a pairs plot, which depicts the
correlation coefficient between numerical variables and scatterplots of
their relationships. The pairs plot uses the Pearson correlation
method.
# subset to numerical variables
DailyAQ_numerics <- DailyAQ_merged[ , c('AQI', 'TMAX', 'TMIN', 'PRCP', 'SNOW', 'year')]
DailyAQ_numerics$year <- DailyAQ_numerics$year - 2000
pairs.panels(DailyAQ_numerics,
method = "pearson", # correlation method
hist.col = "red", # set histogram color
density = TRUE, # show density plots
ellipses = TRUE, # show correlation ellipses
smoother = TRUE,
lm = TRUE,
main = "Pairs Plot Of Weather and Air Quality Numerical Variables",
cex.labels=0.75
)

From the pearson pairsplot above, we can see a moderately high,
negative correlation value between year and AQI. This indicates that as
the year increases, the AQI is actually dropping resulting in better air
quality in the city.
To better observe the effects of year on AQI, we can visualize the
yearly average AQI.
# yearly average and year-over year growth of daily AQI and PM2.5
ggplot(DailyAQ_00_22_Yearly_Growth) +
geom_line(aes(x = year, y = aqi_avg), stat="identity", color="#2DD164", size=1) +
geom_point(aes(x = year, y = aqi_avg), na.rm = TRUE, fill="#457108", shape = 21) +
labs(title="Average AQI by year in NYC", x="Year", y="AQI value")

The line plot confirms the correlation value we observed in the pairs
plot. The average yearly AQI is indeed decreasing as year increases. To
further test our observations, we will build a linear model using year
as a regressor to estimate daily AQI.
aqi_fit <- lm(AQI ~ year, data = DailyAQ_numerics)
summary(aqi_fit)
xkabledply(aqi_fit, title = paste("First Linear Model: ", format( formula(aqi_fit) )))
The results of our linear model reveal a significant value for both
the intercept and year coefficient. The coefficient value for the year
regressor, -1.775, indicates that for every year increase, the predicted
daily AQI decreases by a factor of 1.78. This supports the correlation
coefficient we saw earlier between these two variables. The p-value of
the F-statistic is also significant, but the \(R^2\) value of the model is a measly 0.28.
Based on this model, the year only explains 28% of the variability in
daily AQI measurements. This is not a significantly high result. Looking
at the scatterplot of the relationship can help explain the weak
fit.
ggplot(DailyAQ_00_22, aes(x = year, y = AQI)) +
geom_point(alpha = 0.5, color = "#2DD164", position = "jitter") +
labs(x = "Year", y = "AQI Value", title = "Daily AQI Values From 2000-2022 With Trend Line") +
geom_smooth(method = 'lm', formula = 'y ~ x', color = "black", fill="black")

As we can see, there is a high degree of noise when observing daily
AQI values at the yearly level. Although the plot displays a slightly
downward trend in daily AQI, but model fit is distorted. This helps
explain the results we received from our linear model.
Can we add more or different predictors to improve the fit? In our
first analysis, we looked at linear trends of TMAX over time and
determined a slight positive correlation observed over the years
1900-2022. We also utilized month as a categorical regressor to help
explain the variance in daily maximum temperatures. Based on those
results, we concluded that seasonality trends had a negative impact on
model performance. Perhaps seasonality also also plays a part in daily
AQI measurements?
# NYC weather - Avg TMAX by month
NYweath_Monthly_Avg <- NYweath_00 %>%
group_by(month) %>%
summarize(avg_max_temp = mean(TMAX, na.rm=T),
avg_min_temp = mean(TMIN, na.rm=T))
ggplot(NYweath_Monthly_Avg, aes(x = as.numeric(month), y = avg_max_temp)) +
geom_line(color="#F21E1E", size = 2) +
geom_point(na.rm = TRUE, fill="#126BF4", shape = 21, size = 4) +
labs(title="Average TMAX By Month in NYC", x="Month", y="Temperature (°F)") +
scale_x_continuous(name = "Month",
breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))

To refresh our memories, we included the monthly average daily
maximum temperature. A seasonal trend can be observed as temperatures
increase during summer months and decrease during winter months.
DailyAQ_monthly <- DailyAQ_merged %>%
group_by(month) %>%
summarize(aqi_avg = mean(AQI, na.rm=T))
# monthly average AQI
ggplot(DailyAQ_monthly, aes(x = as.numeric(month), y = aqi_avg)) +
geom_line(color="#47ABE9", size = 2) +
geom_point(na.rm = TRUE, fill="#C10808", shape = 21, size = 4) +
labs(title="Average AQI By Month in NYC", x="Month", y="AQI") +
scale_x_continuous(name = "Month",
breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))

Plotting the average AQI by month, we can see seasonal trends as AQI
values are generally high during winter and summer months but the low in
the months between.
Based on this, we can modify our last model and attempt to fit
seasonality by adding month as a categorical regressor,
along with our variable-of-interest from the last project - TMAX.
aqi_fit2 <- lm(AQI ~ TMAX + month, data = DailyAQ_merged)
summary(aqi_fit2)
xkabledply(aqi_fit2, title = paste("Second Linear Model: ", format( formula(aqi_fit2) )))
The regression coefficient for TMAX is significant and positively
correlated, with each degree Fahrenheit increase resulting in AQI
increasing by a factor of 0.68. The regression coefficients for all
month categories are also significant. In fact, every month has a
negative impact on AQI when compared to January. September exhibits the
largest difference, with a predicted AQI almost 44 points lower than
January!
The p-value of the model’s F-statistic is also significant, allowing
us to reject the null hypothesis and conclude that there’s a significant
relationship between our chosen predictors and the daily AQI value.
However, the \(R^2\) for our model is
only .149, which is weaker than our previous model. This
indicates that only 14.7% of the variation in daily AQI can be explained
by TMAX and month.
# model VIF scores
xkablevif(aqi_fit2, title = "Model 2 VIF")
The VIF scores for all regressors are in an acceptable range, however
the fit is still poor.
From our results, we can conclude that linear models provide an
inaccurate representation of the effects of seasonality within our
data.
It seems that due to seasonality, and the nature of time-series data,
we cannot properly model daily AQI using linear regression. Perhaps a
classification technique can be utilized to properly address the
seasonal trends. More precisely, we can build a kNN model to classify
the month based on daily AQI and maximum temperature values.
We start with plotting the relationship between our chosen predictors
and add a layer to discern month within the scatter.
ggplot(DailyAQ_merged) +
geom_point(aes(x=AQI, y=TMAX, color=month), alpha = 0.7) +
labs(title = "Daily Maximum Temperature vs Daily Air Quality Index Value Distinguished By Month",
x = "Daily AQI Value",
y = "Daily Maximum Temperature (F)") +
scale_color_brewer(palette = "Paired")

We can make out minimal distinction of month from the scatterplot
above, but the model will provide a more detailed analysis.
The kNN model requires us to center and scale our predictor values,
as they are recorded in vastly different units of measurement. We also
need to split our dataset into training and testing frames. We used a
4:1 split for to satisty this requirement.
# center and scale our data
DailyAQ_scaled <- as.data.frame(scale(DailyAQ_merged[4:5], center = TRUE, scale = TRUE))
str(DailyAQ_scaled)
# create train and test data sets with 4:1 splits
set.seed(1000)
DailyAQ_sample <- sample(2, nrow(DailyAQ_scaled), replace=TRUE, prob=c(0.80, 0.20))
DailyAQ_training <- DailyAQ_scaled[DailyAQ_sample == 1, ]
DailyAQ_test <- DailyAQ_scaled[DailyAQ_sample == 2, ]
# create label sets.
DailyAQ_trainLabels <- DailyAQ_merged[DailyAQ_sample == 1, 3]
DailyAQ_testLabels <- DailyAQ_merged[DailyAQ_sample == 2, 3]
nrow(DailyAQ_training)
nrow(DailyAQ_test)
evaluateK = function(k, train_set, val_set, train_class, val_class){
# Build knn with k neighbors considered.
set.seed(1000)
class_knn = knn(train = train_set, #<- training set cases
test = val_set, #<- test set cases
cl = train_class, #<- category for classification
k = k) #<- number of neighbors considered
tab = table(class_knn, val_class)
# Calculate the accuracy.
accu = sum(tab[row(tab) == col(tab)]) / sum(tab)
cbind(k = k, accuracy = accu)
}
# call evaluateK function for each odd k-value between 1 to 21
knn_different_k = sapply(seq(1, 21, by = 2),
function(x) evaluateK(x,
train_set = DailyAQ_training,
val_set = DailyAQ_test,
train_class = DailyAQ_trainLabels,
val_class = DailyAQ_testLabels))
# Reformat the results
knn_different_k = data.frame(k = knn_different_k[1,],
accuracy = knn_different_k[2,])
ggplot(knn_different_k, aes(x = k, y = accuracy)) +
geom_line(color = "orange", size = 1.5) +
geom_point(size = 3) +
labs(title = "kNN Model Accuracy vs k-value",
x = "Model k-value",
y = "Accuracy")

To find the best k-value for our model, we evaluated the model over a
range of k from 1 to 21. From the plot about, it seems 13-nearest
neighbors is a decent choice since it provides the greatest improvement
in predictive accuracy before the incremental improvement trails
off.
# set kval
kval <- 13
# build model
DailyAQ_pred <- FNN::knn(train = DailyAQ_training,
test = DailyAQ_test,
cl = DailyAQ_trainLabels,
k = kval)
# confusion matrix
DailyAQ_confusionMatrix <- caret::confusionMatrix(DailyAQ_pred, reference = DailyAQ_testLabels)
DailyAQ_pred_accuracy <- DailyAQ_confusionMatrix$overall['Accuracy']
xkabledply(as.matrix(DailyAQ_confusionMatrix), title = paste("ConfusionMatrix for k = ", kval))
xkabledply(data.frame(DailyAQ_confusionMatrix$byClass), title=paste("k = ", kval))
The overall accuracy of our model is a relatively weak value of
0.257. This indicates that AQI and TMAX were not good predictors of
month.
# multiclass ROC on test labels
knnROC <- multiclass.roc(DailyAQ_testLabels, as.integer(DailyAQ_pred))
knnROC
A multiclass ROC evaluation on the test labels yields an AUC value of
0.65, which is higher than expected based on the model’s accuracy value.
Still, this is not a significant result based on the AUC threshold of
0.8.
Local weather and local human social and economic activity
The last question our team set out to address if a relationship
exists between local weather and local human social and economic
activity. The relationship has been been explored in the past with
evidence suggesting weather has effects on an individuals mood, thermal
comfort level, social interaction and can influence traffic, travel,
public health, crime rates, and even stock prices (Horanont, 2013). Our
team looked to expand on this prior work by making these observations at
the local level of New York City. We decided to look specifically at the
relationship between local weather and crime rates,stock prices, and
public health.
Two data sets were used to explore weather’s impact on crime rates.
First we looked at NYC shootings with 5,409 observations and then all
arrests with 5,844 observations. Both of these data sets were available
through NYC’s Open Data repository and had data from 2006-2021.
To interrogate weather’s impact on public health, our team looked at
COVID-19 case counts in New York City. The data was available from New
York City’s Open Data repository. The data set consisted of 991
observations between February 29th, 2020 and present day.
For stock market data, we looked at the total daily trade for the DOW
Jones Industrial Index. This data contained 10,822 observations dating
back to December 25, 1979, however trade volume has was only available
since 1988.
To draw conclusions about weather’s relationship with human activity,
we use TMAX and a binary factor of total precipitation as the predictor
weather variables. We incorporated month and year into these
relationships as we saw seasonality effects and changes over time in our
previous effort.
# correlation between shooting & weather data from Emily
NYweathshoot_06_cor <- subset(NYweathshoot_06, select = c(year, day, TMAX, PRCP, SNOW, Shootings, Murders))
str(NYweathshoot_06_cor)
shootcor <- cor(NYweathshoot_06_cor, use = "pairwise.complete.obs")
corrplot::corrplot(shootcor)

cor
#plot from Emily
ggplot(NYweathshoot_06) +
geom_histogram(aes(x=Shootings), na.rm=TRUE, alpha=0.5, color="black", bins=100, binwidth=2, fill = "green") +
labs(title="Distribution of daily NYC shootings (2006-2021)", x="Number of daily Shootings", y="Count")

#Not quite normal-- necessarily right-skewed, but a lot of data points...
#ERG
shoot_Month <- NYweathshoot_06 %>%
group_by(month) %>%
summarize("Shootings" = mean(Shootings, na.rm=T),
"Shooting murders" = mean(Murders, na.rm=T))
# side-by-side bar plot
shoot_Month %>%
dplyr::select(month, "Shootings", "Shooting murders") %>%
gather(key="Value", value="Count", "Shootings", "Shooting murders") %>%
ggplot(aes(x=month, y=Count, fill=Value)) +
geom_col(na.rm=TRUE, alpha=0.5, color="black", position="dodge") +
labs(title="Average NYC daily shootings (2006-2021) by month", x="Month", y="Number of shootings") +
scale_fill_manual(values=c("red", "blue"))

#making scatter plot of Shootings v TMAX-- ERG Add
ggplot(NYweathshoot_06, aes(x = TMAX, y = Shootings, color = month)) +
geom_point() +
# scale_color_gradient(low="red", high="blue") +
#scale_color_manual(values = c("01" = "purple4",
# "07" = "red"), na.value = NA) +
#geom_abline(aes(intercept = -11.05508, slope = 0.02539), col = "black", size = 1) +
labs(
x = "Year",
y = "Maximum Daily Temperature",
title = "Maximum Daily Temperature in Central Park") +
xlab(label = "TMAX (degrees Fahrenheit)") +
ylab(label = "# Shootings") +
ggtitle(label = "Number of NYC Shootings v. TMAX")

#modeling Shootings
#shoot_lm <- lm(formula = Shootings ~ year + month + day + SNOW + PRCP + TMAX + TMIN, data = NYweathshoot_06)
NYweathshoot_06$month <- as.factor(NYweathshoot_06$month)
#shoot_lm <- lm(formula = Shootings ~ year + month + SNOW + PRCP + TMAX, data = NYweathshoot_06)
shoot_lm <- lm(formula = Shootings ~ year + SNOW + PRCP + TMAX, data = NYweathshoot_06)
summary(shoot_lm)
# Add a column to convert PRCP to a binary factor variable. Don't care how much it rains, only if it rains.
NYweath_prcpFact <-NYweath_final
NYweath_prcpFact$PRCP_factor <- cut(NYweath_final$TOT_PRCP, c(-Inf,0, Inf), labels = c(0,1))
NYweath_prcpFact$PRCP_factor <- as.factor(NYweath_prcpFact$PRCP_factor)
Crime data
After looking at the relationship between reported shootings in NYC
and local weather, we expanded our comparison to all arrests, not just
shootings. This data was not previously analyzed by our group so we did
some basic EDA to determine the distribution of the new data and
identify correlations that exist.
#NYcrime <- data.frame(read.csv("/Users/christopherwasher/Documents/DATS6101/NYPD_Arrests_Data__Historic_.csv"))
#NYcrime_agg <- NYcrime %>% count(ARREST_DATE)
NYcrime_count <- tibble(read.csv("./data/NYPD_Arrests_counts.csv"))
NYcrime_count$ARREST_DATE <- as.Date(NYcrime_count$ARREST_DATE, format = "%Y-%m-%d")
#NYcrime_count$day <- format(NYcrime_count$ARREST_DATE, format="%d")
#NYcrime_count$month <- format(NYcrime_count$ARREST_DATE, format="%m")
#NYcrime_count$year <- format(NYcrime_count$ARREST_DATE, format="%Y")
colnames(NYcrime_count)[2] <- "ARREST_DATE"
head(NYcrime_count)
summary(NYcrime_count)
#crime_plot <- plot(NYcrime_count$ARREST_DATE, NYcrime_count$NUM_ARREST)
#crime_boxplot <- boxplot(NYcrime_count$NUM_ARREST)
crime_hist <- ggplot(NYcrime_count, aes(x = NYcrime_count$NUM_ARREST)) +
geom_histogram(fill = "cornflowerblue") +
labs(x = "Number Daily Arrests", y = "Frequency",
title = "Distribution of Daily NYC Arrest Counts")
crime_hist

The histogram shows the distribution of the number of daily arrests
in NYC since 2006. There appears to be a slight bimodal distribution to
the number of daily arrests. However, because of the large sample size,
we will consider this a normal distribution.
We incorporated the crime data with the weather data to look for any
additional correlations. We plotted arrest counts as the independent
variable and temperature as the independent variable, with points coded
by precipitation and month. This plot is below. There are no initial
patterns that can be discerned by this chart.
crimeWeather <- subset(NYweath_prcpFact, year >= 2006 & year < 2022)
NYcrime_count <- NYcrime_count[order(NYcrime_count$ARREST_DATE),]
tail(crimeWeather)
## DATE day month year TMAX TMIN TAVG PRCP SNOW TOT_PRCP PRCP_factor
## 55819 2021-12-26 26 12 2021 51 39 NA 0.00 0 0.00 0
## 55820 2021-12-27 27 12 2021 39 34 NA 0.09 0 0.09 1
## 55821 2021-12-28 28 12 2021 47 36 NA 0.05 0 0.05 1
## 55822 2021-12-29 29 12 2021 44 41 NA 0.14 0 0.14 1
## 55823 2021-12-30 30 12 2021 49 43 NA 0.05 0 0.05 1
## 55824 2021-12-31 31 12 2021 55 48 NA 0.00 0 0.00 0
NYweath_crime <- cbind(crimeWeather, NYcrime_count$NUM_ARREST)
colnames(NYweath_crime)[12] <- c("NUM_ARREST")
#NYweath_crime_plot <- plot(sqrt(NYweath_crime$PRCP), NYweath_crime$NUM_ARREST)
#boxplot((NYweath_crime$TOT_PRCP))
#NY_weathcrime_ggplot <- ggplot(NYweath_crime,
# aes(x = TMAX, y =NUM_ARREST)) +
#geom_point(aes(colour = PRCP_factor), alpha = 0.5) +
#labs(x = "Temperature (ºF)", y = "Number of Daily Arrests",
# title = "Weather Patterns for NYC Crime")
#NY_weathcrime_ggplot
NY_weathcrime_ggplot2 <- NYweath_crime %>%
sample_frac(0.25) %>%
ggplot(aes(x = TMAX, y =NUM_ARREST)) +
geom_point(aes(shape = PRCP_factor, colour = month)) +
labs(x = "Temperature (ºF)", y = "Number of Daily Arrests",
title = "Weather Patterns for NYC Crime") +
guides(colour = guide_legend(nrow = 6))
NY_weathcrime_ggplot2

crime_prcp1 <- subset(NYweath_crime, PRCP_factor == 1)
crime_prcp0 <- subset(NYweath_crime, PRCP_factor == 0)
crime_count_ttest <- t.test(crime_prcp0$NUM_ARREST, crime_prcp1$NUM_ARREST)
crime_count_ttest
Before looking at the correlation between all variables, we performed
a t-test to determine if there was a statistical difference in the
number of arrests on days with precipitation and days without
precipitation. The t-test found a statistically significant different
mean number of arrests on days with and without precipitation, with a
higher number of arrests on days without precipitation.
We also looked at the a correlation plot for weather variables and
arrests. The strongest correlation exists between number of arrests and
year, followed by month. Temperature appears to have no correlation with
crime while precipitation has a very minimal negative correlation with
crime. However, because we know temperature and precipitation have a
seasonal component to them, and the t-test showed precipitation had some
effect on arrests, these correlations might not tell the whole story. We
decided to create a linear regression model that incorporates all of
these variables.
crime_cor_data <- NYweath_crime
crime_cor_data$NUM_ARREST <- as.numeric(crime_cor_data$NUM_ARREST)
crime_cor_data <- cbind(crime_cor_data[3:6], crime_cor_data[8:12])
pairs.panels(crime_cor_data,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = FALSE, # show density plots
ellipses = FALSE, # show correlation ellipses
)

#crimeWeath_lm <- lm(NUM_ARREST ~ TMAX + PRCP_factor + year,
# data = NYweath_crime)
crimeWeathMonth_lm <- lm(NUM_ARREST ~ TMAX + PRCP_factor + year + month,
data = NYweath_crime)
#crimeWeathTMIN_lm <- lm(NUM_ARREST ~ (TMIN + PRCP_factor),
# data = NYweath_crime)
#summary(crimeWeathMonth_lm)
xkabledply(crimeWeathMonth_lm, title = paste("Linear Model: ", format(formula(crimeWeathMonth_lm) )))
The linear model of arrests as predicted by of temperature,
precipitation, year, and month regressors. The model has an overall R^2
value 0.427 which is a weak overall fit for the data, but does indicate
that these regressors explain some of the variability in the data. The
coefficients for TMAX, Precipitation Factor, year, and most months are
statistically significant.
There is a weak positive relationship between arrests and TMAX, for
every degree increase in temperature there are about 2.34 more arrests.
Precipitation had a slightly more impactful effect on the number of
daily arrests. The linear model shows a decrease of 46.13 arrests on
days with precipitation, confirming what we saw in our t-test. The model
also shows a decreasing number of arrests over time and that most months
have a statistically significant relationship with the number of daily
arrests.
Overall, we can see that there are statistically significant
relationships between weather and crime in NYC. This indicates that
local weather conditions have some effect on local human activity but we
still want to explore additional variables.
Stock Market Data
Next, we looked at the effect weather has on the stock market,
specifically by looking at the trade volume of the Dow Jones Industrial
index in the 1990s. We looked at this time period in particular due to
the transition to digital stock trading with the rise of computers. We
hypothesized that if weather had an effect on trade volume, the effect
would be greatest for this time given the data available.
To start investigating this relationship, initial EDA was performed.
Again, we looked at the distribution of the data and correlations that
exist between the variables.
NYstock <- tibble(read.csv("./data/Dow Jones Industrial Average Historical Data.csv"))
tail(NYstock)
NYstock$Date <- as.Date(NYstock$Date, format = "%m/%d/%y")
NYstock2 <- NYstock
NYstock2 <- NYstock2 %>%
complete(Date = seq.Date(min(Date), max(Date), by="day"))
options(scientific=T, digits = 10)
# This is all just test code for figuring out how to clean the data.
# Not part of final script.
#NYstocktest <- NYstock2
#NYstocktest$Vol. = substr(NYstocktest$Vol.,1,nchar(NYstocktest$Vol.)-1)
#tail(NYstocktest)
#NYstocktest$Price <- gsub(",", "", NYstocktest$Price)
#NYstocktest[3:5] <- lapply(NYstocktest[3:5], gsub, pattern = ",", replacement = "")
#NYstocktest$Change.. <- gsub("%", "", NYstocktest$Change..)
#NYstocktest[2:7] <- sapply(NYstocktest[2:7], as.numeric)
###
NYstock2$Vol. = substr(NYstock2$Vol., 1, nchar(NYstock2$Vol.) - 1)
NYstock2[2:5] <- lapply(NYstock2[2:5], gsub, pattern = ",", replacement = "")
NYstock2$Change.. <- gsub("%", "", NYstock2$Change..)
NYstock2[2:7] <- sapply(NYstock2[2:7], as.numeric)
NYstock2$day <- format(NYstock2$Date, format="%d")
NYstock2$month <- format(NYstock2$Date, format="%m")
NYstock2$year <- format(NYstock2$Date, format="%Y")
#head(NYstock2)
#summary(NYstock2)
options(scientific=T, digits = 3)
NYstock_final <- NYstock2[,c("Date", "Vol.")]
NYstock_final <- subset(NYstock_final, Date <= "2022-09-26")
weather_stockDates <- subset(NYweath_prcpFact, DATE >= "1979-12-25")
stockWeather <- cbind(weather_stockDates, NYstock_final)
colnames(stockWeather)[13] <- c("Volume")
stockWeather_rmNA <- subset(stockWeather, !is.na(Volume))
stock_hist <- hist(stockWeather_rmNA$Volume)

stockWeather_rmNA_gghist <- ggplot(stockWeather_rmNA,
aes(x = stockWeather_rmNA$Volume)) +
geom_histogram(fill = "cornflowerblue") +
labs(x = "Total Daily Trade Volume (M) ", y = "Frequency",
title = "Distribution of Daily DOW Trade Volume")
stockWeather_rmNA_gghist

The histogram of the daily DOW trade volume shows the data is right
skewed. We plan to use this data in a linear model so we will use a
square root normalization of the volume, and reexamine the
distribution.
stockWeather_rmNA$Volume_norm <- sqrt(stockWeather_rmNA$Volume)
stockWeather_rmNA <- subset(stockWeather_rmNA, select = -c(Date))
stockWeather_90s <- subset(stockWeather_rmNA, year >= 1988 & year <= 1999)
#hist(stockWeather_90s$Volume_norm)
#boxplot(stockWeather_rmNA$Volume_norm)
stockWeather90Norm_rmNA_gghist <- ggplot(stockWeather_90s,
aes(x = stockWeather_90s$Volume_norm)) +
geom_histogram(fill = "cornflowerblue") +
labs(x = "Normalized Daily Trade Volume (M) ", y= "Frequency",
title = "Distribution of Normalized DOW Trade Volume")
stockWeather90Norm_rmNA_gghist

The distribution of sqrt Volume is considerably more normal, although
there still appears to be a right-skew to the distribution. Given the
number of observations, we will again assume this is a normal enough
distribution for further analysis.
stock_prcp1 <- subset(stockWeather_90s, PRCP_factor == 1)
stock_prcp0 <- subset(stockWeather_90s, PRCP_factor == 0)
stock_count_ttest <- t.test(stock_prcp0$Volume_norm,
stock_prcp1$Volume_norm)
stock_count_ttest
A t-test was performed to compare the mean normalized stock trade
volume on days with and without precipitation. The t-test did not reject
the null hypothesis as the values are nearly identical regardless of
precipitation. We then wanted to look at correlations between all the
variables.
pairs.panels(stockWeather_90s[c("year", "month", "TMAX",
"TOT_PRCP","PRCP_factor",
"Volume","Volume_norm")],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = FALSE, # show density plots
ellipses = FALSE # show correlation ellipses
)

The only strong correlation that exists in this data set is the
relationship between year and the stock trade volume. There are no other
strong correlations present in the data. This appears to be supported by
the scatter plot of the Stock Volume vs TMAX, with points encoded by
month and PRCP factor.
#NY_weathstock_scatter <- ggplot(stockWeather_rmNA, aes(x = TMAX, y #=Volume_norm)) +
# geom_point(aes(colour = PRCP_factor)) +
# labs(x = "Temperature", y = "Total Daily DOW Trade Volume",
# title = "Weather Patterns for DOW Trade Volume")
#NY_weathstock_scatter
## Trying again with the 90's stock data.
#NY_90s_weathstock_scatter <- ggplot(stockWeather_90s, aes(x = TMAX, y =Volume_norm)) +
# geom_point(aes(colour = PRCP_factor)) +
#labs(x = "Temperature (ºF)", y = "Normalized Daily DOW Trade Volume (M)",
# title = "Weather Patterns for DOW Trade Volume in the 1990s")
#NY_90s_weathstock_scatter
NY_90s_weathstock_scatter2 <- stockWeather_90s %>%
sample_frac(0.3) %>%
ggplot(aes(x = TMAX, y =Volume_norm)) +
geom_point(aes(colour = month, shape = PRCP_factor)) +
labs(x = "Temperature (ºF)", y = "Normalized Daily DOW Trade Volume (M)",
title = "Weather Patterns for DOW Trade Volume in the 1990s") +
guides(colour = guide_legend(nrow = 6))
NY_90s_weathstock_scatter2
With no noticeable correlations between weather and stock volume, a
linear model was created to highlight potential hidden relationships
that may exist in the data.
#stock_LM <- lm(Volume_norm ~ TMAX + PRCP_factor + year + month,
# stockWeather_rmNA)
#summary(stock_LM)
stock_LM_90s <- lm(Volume_norm ~ TMAX + PRCP_factor + year + month,
stockWeather_90s)
#summary(stock_LM_90s)
xkabledply(stock_LM_90s, title = paste("Linear Model: ", format(formula(stock_LM_90s) )))
Linear Model: Volume_norm ~ TMAX + PRCP_factor + year + month
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
|
(Intercept)
|
-861.9944
|
11.8722
|
-72.606
|
0.0000
|
|
TMAX
|
0.0052
|
0.0023
|
2.236
|
0.0254
|
|
PRCP_factor1
|
-0.0352
|
0.0437
|
-0.805
|
0.4209
|
|
year
|
0.4352
|
0.0060
|
73.099
|
0.0000
|
|
month02
|
-0.1506
|
0.1032
|
-1.458
|
0.1448
|
|
month03
|
-0.2522
|
0.1017
|
-2.479
|
0.0132
|
|
month04
|
-0.1396
|
0.1124
|
-1.242
|
0.2142
|
|
month05
|
-0.3995
|
0.1215
|
-3.288
|
0.0010
|
|
month06
|
-0.4460
|
0.1349
|
-3.306
|
0.0010
|
|
month07
|
-0.4749
|
0.1435
|
-3.308
|
0.0009
|
|
month08
|
-0.4178
|
0.1391
|
-3.003
|
0.0027
|
|
month09
|
-0.2517
|
0.1285
|
-1.958
|
0.0503
|
|
month10
|
0.0814
|
0.1135
|
0.717
|
0.4734
|
|
month11
|
0.0488
|
0.1058
|
0.461
|
0.6446
|
|
month12
|
-0.0297
|
0.1007
|
-0.295
|
0.7678
|
The Linear Model incorporates all the Stock data from 1988-2000 had a
statistically significant TMAX, year, and some of the months. The
relationship between TMAX and trade volume is misleading because of the
difference in scale between the variables. The trade volume is on the
scale of millions so even though there is a statistically significant
relationship, it is not a meaningful correlation. The relationship with
TMAX is more likely related to the seasonality that is shown by the
statistically significant months in the model as the warmer months more
statistically significant. The overall model has an r-squared value of
0.641, but that is likely primarily driven by the year variable, as
indicated by the strong correlation above.
Overall, there were very weak relationships between stock trade
volume and weather, notably TMAX. But this relationship is not strong
enough to be predictive. There was also no statistically significant
relationship between stock volume and precipitation.
COVID Data
The last human activity we wanted to explore was public health by
looking at a relationship between weather and COVID-19 case counts. In
our previous effort, we saw that COVID-19 lockdown had an effect on the
temperature in NYC so here we are looking for weather’s effect on the
number of cases in the city.
To explore this relationship, we use NYC’s Open Data to pull daily
COVID-19 case counts dating back to February 29, 2020. One notable flaw
with this data set was that case counts were tracked based on
confirmation dates rather than test dates, which means the weather data
does not always match up to the case number dates. We continued with the
analysis with the assumption that same day tests would out number longer
turnaround tests and relationships that exist would still be
identifiable.
This was new data for our team so we performed initial EDA followed
by a linear model to observe relationship and predict COVID-19 case
counts.
options(scientific=T, digits = 3)
NYcovid <- tibble(read.csv("./data/COVID-19_Daily_Counts_of_Cases__Hospitalizations__and_Deaths.csv"))
NYcovid <- select(NYcovid, 1:3)
head(NYcovid)
colnames(NYcovid)[1] <- "DATE"
NYcovid$DATE <- as.Date(NYcovid$DATE, format = "%m/%d/%Y")
NYcovid$day <- format(NYcovid$DATE, format="%d")
NYcovid$month <- format(NYcovid$DATE, format="%m")
NYcovid$year <- format(NYcovid$DATE, format="%Y")
head(NYcovid)
summary(NYcovid)
hist(NYcovid$CASE_COUNT)

covid_plot <- plot(NYcovid$DATE, NYcovid$CASE_COUNT)

covid_boxplot <- boxplot(NYcovid$CASE_COUNT)

NYcovid_rmOuts <- outlierKD2(NYcovid, CASE_COUNT, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

NYcovid_rmOuts2 <- outlierKD2(NYcovid_rmOuts, CASE_COUNT, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

NYcovid_rmOuts3 <- outlierKD2(NYcovid_rmOuts2, CASE_COUNT, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

NYcovid_rmOuts4 <- outlierKD2(NYcovid_rmOuts3, CASE_COUNT, rm=TRUE, boxplt=TRUE, histogram=TRUE, qqplt=TRUE)

covid_plot <- plot(NYcovid_rmOuts4$DATE, NYcovid_rmOuts4$CASE_COUNT)
tail(NYcovid_rmOuts3)
sqrt_count <- sqrt(NYcovid_rmOuts3$CASE_COUNT)
#hist(sqrt_count)
NYcovid_final <- cbind(NYcovid_rmOuts4, sqrt_count)
head(NYcovid_final)

cov_final_hist <- ggplot(NYcovid_final,
aes(x = NYcovid_final$sqrt_count)) +
geom_histogram(fill = "cornflowerblue", ) +
labs(x = "Normalized COVID Case Count ", y = "Frequency",
title = "Distribution of Normalized COVID Case Count")
cov_final_hist

We looked at normality of the COVID count data, saw counts were
extremely skewed to the right and contained. After removing all
outliers, the data was still skewed right but less extreme and we used a
square-root transform to normalize the data.
covWeather <- subset(NYweath_prcpFact, DATE >= ("2020-02-29"))
NYcovid_finaldates <- subset(NYcovid_final, DATE <= "2022-09-26")
tail(covWeather)
NYweath_prcpCov <- cbind(covWeather, NYcovid_finaldates$CASE_COUNT,
NYcovid_finaldates$sqrt_count)
colnames(NYweath_prcpCov)[12:13] <- c("CASE_COUNT", "sqrt_count")
covCount_prcp_plot <- plot(NYweath_prcpCov$sqrt_count, sqrt(NYweath_prcpCov$PRCP))

NYweath_cov_final <- NYweath_prcpCov[,c(1:5, 8, 10:13)]
A t-test comparing COVID-19 case counts on days with precipitation
and days without precipitation was then performed. Similar to what was
seen in the stock market t-test, there was not a statistically
significant difference in the mean number of COVID-19 case counts on
days with and without precipitation. Boxplots of the precipitation and
no precipitation COVID case counts are below.
cov_prcp1 <- subset(NYweath_cov_final, PRCP_factor == 1)
cov_prcp0 <- subset(NYweath_cov_final, PRCP_factor == 0)
cov_count_ttest <- t.test(cov_prcp0$sqrt_count, cov_prcp1$sqrt_count)
cov_count_ttest
cov_count_bplot <- ggplot()+
geom_boxplot(data = NYweath_cov_final,
aes(y = sqrt_count, x = PRCP_factor, fill = PRCP_factor)) +
labs(title = "COVID Positive Counts")
cov_count_bplot

## Repeating this EDA looking only at Covid cases from 2021+.
cov_2021 <- subset(NYweath_cov_final, year >= 2021)
cov_2021count_bplot <- ggplot()+
geom_boxplot(data = cov_2021, aes(y = sqrt_count, x = PRCP_factor)) +
labs(title = "2021 COVID Positive Counts")
cov_2021count_bplot

cov_2021prcp1 <- subset(cov_2021, PRCP_factor == 1)
cov_2021prcp0 <- subset(cov_2021, PRCP_factor == 0)
cov_2021count_ttest <- t.test(cov_2021prcp0$sqrt_count, cov_2021prcp1$sqrt_count)
cov_2021count_ttest
Following the t-test, we wanted to look at the correlations of all
the variables that can be included in a linear model.
library(psych)
pairs.panels(NYweath_cov_final[4:10],
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = FALSE, # show density plots
ellipses = FALSE # show correlation ellipses
)

The strongest correlation existed between year and the COVID-19 case
counts, followed by a negative correlation between daily temperature and
case counts. There was no correlation between precipitation and COVID
case counts.
These variables were plotted in the scatter plot below to visualize
the relationships prior to building the linear model.
covWeath_final_scatter <- ggplot(NYweath_cov_final,
aes(x = TMAX,
y =sqrt_count,
)) +
geom_point(aes(colour = month, shape = PRCP_factor)) +
labs(x = "Temperature",
y = "Normalized COVID-19 Case Counts",
title = "Weather Patterns for Covid Case Counts") +
guides(colour = guide_legend(nrow = 6))
covWeath_final_scatter
Using this EDA, we built a linear model that predicts the normalized
COVID-19 case counts using TMAX, precipitation as a factor, year, and
month, as the regressors. Year and Month were included in the linear
model because we have to account for seasonal changes in both weather
and COVID-19.
#cov_weathLM <- lm(sqrt_count ~ TMAX + PRCP_factor + year,
# data = NYweath_cov_final)
#summary(cov_weathLM)
cov_weathLM_month <- lm(sqrt_count ~ TMAX + PRCP_factor + year + month,
data = NYweath_cov_final)
summary(cov_weathLM_month)
##
## Call:
## lm(formula = sqrt_count ~ TMAX + PRCP_factor + year + month,
## data = NYweath_cov_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.67 -8.91 -0.68 8.14 41.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.56e+04 1.22e+03 -12.74 < 2e-16 ***
## TMAX -5.53e-02 5.57e-02 -0.99 0.3210
## PRCP_factor1 -6.30e-01 9.33e-01 -0.68 0.4994
## year 7.74e+00 6.05e-01 12.79 < 2e-16 ***
## month02 -1.87e+01 3.03e+00 -6.16 1.1e-09 ***
## month03 -1.70e+01 2.99e+00 -5.67 1.9e-08 ***
## month04 -9.47e+00 3.15e+00 -3.01 0.0027 **
## month05 -1.91e+01 3.41e+00 -5.62 2.6e-08 ***
## month06 -2.63e+01 3.76e+00 -7.00 5.0e-12 ***
## month07 -2.15e+01 3.94e+00 -5.46 6.2e-08 ***
## month08 -2.07e+01 3.89e+00 -5.32 1.3e-07 ***
## month09 -2.32e+01 3.62e+00 -6.39 2.6e-10 ***
## month10 -2.54e+01 3.43e+00 -7.40 3.3e-13 ***
## month11 -1.66e+01 3.22e+00 -5.15 3.2e-07 ***
## month12 2.09e+00 3.31e+00 0.63 0.5289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.2 on 862 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.358, Adjusted R-squared: 0.347
## F-statistic: 34.3 on 14 and 862 DF, p-value: <2e-16
xkabledply(cov_weathLM_month, title = paste("Linear Model: ", format(formula(cov_weathLM_month) )))
Linear Model: sqrt_count ~ TMAX + PRCP_factor + year + month
|
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
|
(Intercept)
|
-1.56e+04
|
1.22e+03
|
-12.739
|
0.0000
|
|
TMAX
|
-5.53e-02
|
5.57e-02
|
-0.993
|
0.3210
|
|
PRCP_factor1
|
-6.30e-01
|
9.33e-01
|
-0.676
|
0.4994
|
|
year
|
7.74e+00
|
6.05e-01
|
12.788
|
0.0000
|
|
month02
|
-1.87e+01
|
3.03e+00
|
-6.159
|
0.0000
|
|
month03
|
-1.70e+01
|
2.99e+00
|
-5.674
|
0.0000
|
|
month04
|
-9.47e+00
|
3.15e+00
|
-3.010
|
0.0027
|
|
month05
|
-1.91e+01
|
3.41e+00
|
-5.618
|
0.0000
|
|
month06
|
-2.63e+01
|
3.76e+00
|
-7.004
|
0.0000
|
|
month07
|
-2.15e+01
|
3.94e+00
|
-5.461
|
0.0000
|
|
month08
|
-2.07e+01
|
3.89e+00
|
-5.318
|
0.0000
|
|
month09
|
-2.32e+01
|
3.62e+00
|
-6.395
|
0.0000
|
|
month10
|
-2.54e+01
|
3.43e+00
|
-7.396
|
0.0000
|
|
month11
|
-1.66e+01
|
3.22e+00
|
-5.150
|
0.0000
|
|
month12
|
2.09e+00
|
3.31e+00
|
0.630
|
0.5289
|
#cov2021_weathLM <- lm(CASE_COUNT ~ TMAX + PRCP_factor + year + month,
# data = cov_2021)
#summary(cov2021_weathLM)
The linear model predicting COVID-19 cases in New York City with
weather, year, and month as regressors. The model has an Adjusted
R-squared value of 0.347. This indicates the model is a weak fit but it
does explain some of the variation in the number of cases. However, the
model coefficients are not all significant. Both TMAX and precipitation
factor are not statistically significant in this model. This indicates
that they are not predictive of COVID Case counts. Instead, year and
most months are statistically significant, which drives the fit and
predictive ability of the model.
Overall, this model indicates that weather, via TMAX and
precipitation factor, do not have a relationship with the number of
COVID-19 cases in NYC. However, the major assumption about COVID-19
testing dates and confirmation dates made when starting this analysis,
likely had an effect on the results. This makes the relationship between
COVID-19 case counts and weather a topic worth pursuing in future
study.
LR to predict Precipitation!
The last analysis performed does not directly address our original
questions, but our team looked to use the relationships between human
activity and weather to predict precipitation as a binary factor.
Our team built a logistic regression to predict precipitation using
TMAX, year, month, and human activity via number of arrests and stock
volume. These two variables were chosen for human activity because the
previous linear models show they had some statistical significant
relationship to weather. We wanted to explore whether they would be
predictive of precipitation if combined into one model.
stock_masterDates <- subset(NYstock_final,Date >= "2006-01-01" &
Date <= "2021-12-31")
crime_masterDates <- NYcrime_count
weath_masterDates <- subset(NYweath_prcpFact, year >= 2006 & year < 2022)
master_log <- cbind(weath_masterDates,
crime_masterDates$NUM_ARREST,
stock_masterDates$Vol.)
colnames(master_log)[12:13] <- c('NUM_ARREST', 'Volume')
head(master_log)
master_logFinal <- subset(master_log, !is.na(Volume))
master_logFinal$Volume_norm <- sqrt(master_logFinal$Volume)
prcp_logit <- glm(PRCP_factor ~ TMAX + NUM_ARREST +
Volume_norm + year + month,
data = master_logFinal,
family = binomial(link = "logit"))
#summary(prcp_logit)
xkabledply(prcp_logit, title = paste("Logistic Model: ", format(formula(prcp_logit) )))
Logistic Model: PRCP_factor ~ TMAX + NUM_ARREST + Volume_norm + year +
month
|
|
Estimate
|
Std. Error
|
z value
|
Pr(>|z|)
|
|
(Intercept)
|
72.9312
|
20.4760
|
3.562
|
0.0004
|
|
TMAX
|
-0.0057
|
0.0038
|
-1.503
|
0.1327
|
|
NUM_ARREST
|
-0.0009
|
0.0001
|
-6.669
|
0.0000
|
|
Volume_norm
|
-0.0103
|
0.0089
|
-1.151
|
0.2497
|
|
year
|
-0.0359
|
0.0101
|
-3.546
|
0.0004
|
|
month02
|
0.2426
|
0.1681
|
1.443
|
0.1489
|
|
month03
|
0.1495
|
0.1693
|
0.883
|
0.3772
|
|
month04
|
0.4001
|
0.1854
|
2.158
|
0.0309
|
|
month05
|
0.3266
|
0.2052
|
1.592
|
0.1114
|
|
month06
|
0.5036
|
0.2230
|
2.259
|
0.0239
|
|
month07
|
0.3447
|
0.2418
|
1.425
|
0.1541
|
|
month08
|
0.2012
|
0.2363
|
0.851
|
0.3946
|
|
month09
|
-0.0494
|
0.2227
|
-0.222
|
0.8244
|
|
month10
|
0.0741
|
0.1933
|
0.384
|
0.7013
|
|
month11
|
-0.0123
|
0.1787
|
-0.069
|
0.9450
|
|
month12
|
0.0470
|
0.1695
|
0.277
|
0.7815
|
The logist model to predict the log odds of precipitation, however
most of the coefficients in the model are not statistically significant.
Year, two separate months, and number of arrests are statistically
significant. For year, for every one unit increase in year, the odds of
precipitation decrease by a factor of 0.01. This is not a particularly
strong effect on the odds. For number of arrests, for every additional
arrest, the odds of precipitation decrease by a factor 0.001. This again
is not a strong effect, but when accounting for the difference in scale
for number of arrests per day, this is a more interesting
relationship.
library(ModelMetrics)
prcpLR_cm <- confusionMatrix(actual = prcp_logit$y,
predicted = prcp_logit$fitted.values)
prcpLR_cm
prcpLR_acc <- (prcpLR_cm[2,2] + prcpLR_cm[1,1])/(sum(prcpLR_cm))
prcpLR_prec <- (prcpLR_cm[2,2])/(prcpLR_cm[2,2]+prcpLR_cm[1,2])
prcpLR_rec <- (prcpLR_cm[2,2])/(prcpLR_cm[2,2]+prcpLR_cm[2,1])
prcpLR_spec <- (prcpLR_cm[1,1])/(prcpLR_cm[1,1]+prcpLR_cm[1,2])
library(pscl)
prcp_pr2 <- pR2(prcp_logit)
prcp_pr2
Unfortunately, this logistic regression is not a great model to
predict precipitation. The overall accuracy of the model, using a
default cutoff of 0.5, is 0.64. This is not a great improvement over a
null model. Additionally, the precision for this model is only 0.03 and
the recall rate is only 0.51.
The McFadden value for this model is 0.01, which shows this model
does not explain any variability in the precipitation.
We also plotted the receiver operator characteristic (ROC) to measure
the true positive rate verse the true negative rate of this model.
library(pROC)
master_logFinal$prob=predict(prcp_logit, type = c("response"))
prcp_roc <- roc(PRCP_factor ~ prob, data = master_logFinal)
prcp_auc <- pROC::auc(prcp_roc)
prcp_auc
## Area under the curve: 0.575
plot(prcp_roc)

The area under the ROC curve is 0.575. This confirms the model is not
a good fit as it is only slightly greater than the null AUC of 0.5.
Overall, this linear regression was not a good model to predict
whether or not precipitation occurred based on human activity. The
accuracy of the model and the overall fit were both too poor to be
useful for any predictive context. However, this logistic regression
appears to confirm a relationship between number of arrests and
precipitation. This relationship would be something of interest to
explore further in future study.
Summary of Key Findings
We have statistically significant correlations on models using
weather, air quality, and human activity data from NYC, but none of our
models demonstrate high predictive potential.
Challenges with data – data were not ideal for our initial
hypotheses– cannot reject or accept. Would need better data!
Cannot model seasonality with linear regressions. Potentially need to
use seasonality-adjusted time-series models for better outcomes.
What data would enable us to answer the question if we had more
time?
Conclusions
- What were the limitations to your methods
- Data availability issues. Not having all data we wanted to
explore.
- Brief model evaluation and comparison
Tejas’ conclusion
Our hypothesis that a correlation would exist between daily weather
and air quality variables was ultimately proven wrong. We observed
trends of declining AQI over time, but the explanation of variance from
our model results was not strong enough to deem the model a good fit.
Similarly, a linear model predicting AQI based on the categorical month
variable, along with TMAX, also resulted in a poor fit.
We determined that the relationship between air quality and global
warming is difficult to model using linear techniques due to seasonal
trends in the variables. Our attempt to model the effect of these trends
using kNN also resulted in a poor-fitted model. Ultimately, a different
type of model would be required to address the seasonal component.
Also, changes in climate are slow to take effect. A increase in
emissions does not necessarily lead to increases in temperature on the
same time scale. All these effects would need to be taken into
consideration for an effective analysis.
Chris Conclusion:
Our hypothesis that a relationship between daily weather and local
human activity, via crime, stock market, and public health data, was
shown to exist in some areas. Both crime and stock market trade volume
had statistically significant correlations to daily weather variables in
our linear models. Crime is correlated to both temperature and
precipitation while stock market trade volume is related to temperature.
However, neither of these correlations are strong enough to be
predictive.
The relationship between crime and weather was the strongest from
this analysis. This represents a valuable area for future study in the
context of changing weather patterns that was explored in our early
project.
There were notable limitations to the methods in this analysis. One
key limitation that affected the analysis of public health was the
availability of essential data. The COVID-19 case data was based on
dates when positive cases were confirmed, rather than tested. Because
test and confirmation dates are not always the same, this limited our
ability assess relationships that existed on the day of an individual’s
test. Looking for alternative data sources to explore this relationship
would be an interesting area for a future project.
These questions are only some of the important questions that should
be asked about the relationship between humans and weather. As climate
change continues, it is important to have an understanding on how it may
affect individuals at all levels from the global scale down to local
level.
References (APA Style)
Lindsey, R.; Dahlman, L. (2022, June 28). Climate change: Global
temperature. Climate.gov. Retrieved December 11, 2022, from https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature
United States Environmental Protection Agency, Air Data Basic
Information., retrieved December 8, 2022, from,
www.epa.gov/outdoor-air-quality-data/air-data-basic-information.
NYC Environment and Health Data Portal, Tracking changes in New
York City’s sources of air pollution., published online April 12,
2021, retrieved December 6, 2022, from https://a816-dohbesp.nyc.gov/IndicatorPublic/beta/data-stories/aq-cooking/
Horanont, T., Phithakkitnukoon, S., Leong, T. W., Sekimoto, Y., &
Shibasaki, R. (2013). Weather effects on the patterns of people’s
everyday activities: a study using GPS traces of mobile phone users.
PloS one, 8(12), e81153. https://doi.org/10.1371/journal.pone.0081153